imagenet-1k dataset
Appendix [KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training ] Anonymous Author(s) Affiliation Address email Appendix A. Proof of Lemma 1
Table 1 summarizes the models and datasets used in this work. ImageNet-1K Deng u. a. (2009): We use the subset of the ImageNet dataset containing DeepCAM Kurth u. a. (2018): DeepCAM dataset for image segmentation, which consists of Fractal-3K Kataoka u. a. (2022) A rendered dataset from the Visual Atom method Kataoka We also use the setting in Kataoka u. a. (2022) Table 2 shows the detail of our hyper-parameters. Specifically, We follow the guideline of'TorchVision' to train the ResNet-50 that uses the CosineLR To show the robustness of KAKURENBO, we also train ResNet-50 with different settings, e.g., ResNet-50 (A) setting, we follow the hyper-parameters reported in Goyal u. a. (2017). It is worth noting that KAKURENBO merely hides samples before the input pipeline. In this section, we present an analysis of the factors affecting KAKURENBO's performance, e.g., the The result shows that our method could dynamically hide the samples at each epoch.
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting Supplemental Material Ziyi Wang Xumin Y u Y ongming Rao Jie Zhou Jiwen Lu
During the geometry-preserved projection, several points may fall in the same pixel. Here we conduct ablations on the pooling strategy in Table 3, including max-pooling, mean-pooling and summation. From the classification ablation results, summation is better than max-pooling and mean-pooling. After migrating them to point cloud analysis with Point-to-Pixel Prompting, we report the number of trainable parameters (Tr. We choose 4 segments of ϕ .
- South America > Peru > Loreto Department (0.05)
- Asia > China (0.05)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
A ImageNet Texture
See Figures 7 and 8 for examples of the ImageNet-Texture dataset and their counterparts in the original ImageNet dataset. Shape is often less well-defined in these classes, for example in window screen and rapeseed. B.1 Comparison of two ways to apply α in NCE loss Since the denominator normalizes the 3 kinds of pairs equally, we only pay attention to the numerator. Because of the exponential tail, it applies a exponentially larger weight to the negatives that are harder. Our patch-based augmentation is also closely related to some of the self-supervised learning methods which solve jigsaw as the pretext task. All of our models are trained on 4 GTX 1080 Ti gpus.
FocusDD: Real-World Scene Infusion for Robust Dataset Distillation
Hu, Youbing, Cheng, Yun, Saukh, Olga, Ozdemir, Firat, Lu, Anqi, Cao, Zhiqiang, Li, Zhijun
Dataset distillation has emerged as a strategy to compress real-world datasets for efficient training. However, it struggles with large-scale and high-resolution datasets, limiting its practicality. This paper introduces a novel resolution-independent dataset distillation method Focus ed Dataset Distillation (FocusDD), which achieves diversity and realism in distilled data by identifying key information patches, thereby ensuring the generalization capability of the distilled dataset across different network architectures. Specifically, FocusDD leverages a pre-trained Vision Transformer (ViT) to extract key image patches, which are then synthesized into a single distilled image. These distilled images, which capture multiple targets, are suitable not only for classification tasks but also for dense tasks such as object detection. To further improve the generalization of the distilled dataset, each synthesized image is augmented with a downsampled view of the original image. Experimental results on the ImageNet-1K dataset demonstrate that, with 100 images per class (IPC), ResNet50 and MobileNet-v2 achieve validation accuracies of 71.0% and 62.6%, respectively, outperforming state-of-the-art methods by 2.8% and 4.7%. Notably, FocusDD is the first method to use distilled datasets for object detection tasks. On the COCO2017 dataset, with an IPC of 50, YOLOv11n and YOLOv11s achieve 24.4% and 32.1% mAP, respectively, further validating the effectiveness of our approach.
MILAN: Masked Image Pretraining on Language Assisted Representation
Hou, Zejiang, Sun, Fei, Chen, Yen-Kuang, Xie, Yuan, Kung, Sun-Yuan
Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction target, we propose a more effective prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model. Experimental results demonstrate that MILAN delivers higher accuracy than the previous works. When the masked autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input resolution of 224x224, MILAN achieves a top-1 accuracy of 85.4% on ViT-Base, surpassing previous state-of-the-arts by 1%. In the downstream semantic segmentation task, MILAN achieves 52.7 mIoU using ViT-Base on ADE20K dataset, outperforming previous masked pretraining results by 4 points.